Python is a modern, open source, object-oriented programming language, created by a Dutch programmer, Guido van Rossum. Officially, it is an interpreted scripting language (meaning that it is not compiled until it is run) for the C programming language; in fact, Python itself is coded in C (though there are other non-C implementations). Frequently, it is compared to languages like Perl and Ruby. It offers the power and flexibility of lower level (i.e. compiled) languages, without the steep learning curve, and without most of the associated programming overhead. The language is very clean and readable, and it is available for almost every modern computing platform.
Python offers a number of advantages to scientists, both for experienced and novice programmers alike:
Powerful and easy to use
Python is simultaneously powerful, flexible and easy to learn and use (in general, these qualities are traded off for a given programming language). Anything that can be coded in C, FORTRAN, or Java can be done in Python, almost always in fewer lines of code, and with fewer debugging headaches. Its standard library is extremely rich, including modules for string manipulation, regular expressions, file compression, mathematics, profiling and debugging (to name only a few). Unnecessary language constructs, such as END
statements and brackets are absent, making the code terse, efficient, and easy to read. Finally, Python is object-oriented, which is an important programming paradigm particularly well-suited to scientific programming, which allows data structures to be abstracted in a natural way.
Interactive
Python may be run interactively on the command line, in much the same way as Octave or S-Plus/R. Rather than compiling and running a particular program, commands may entered serially followed by the Return
key. This is often useful for mathematical programming and debugging.
Extensible
Python is often referred to as a “glue” language, meaning that it is a useful in a mixed-language environment. Frequently, programmers must interact with colleagues that operate in other programming languages, or use significant quantities of legacy code that would be problematic or expensive to re-code. Python was designed to interact with other programming languages, and in many cases C or FORTRAN code can be compiled directly into Python programs (using utilities such as f2py
or weave
). Additionally, since Python is an interpreted language, it can sometimes be slow relative to its compiled cousins. In many cases this performance deficit is due to a short loop of code that runs thousands or millions of times. Such bottlenecks may be removed by coding a function in FORTRAN, C or Cython, and compiling it into a Python module.
Third-party modules
There is a vast body of Python modules created outside the auspices of the Python Software Foundation. These include utilities for database connectivity, mathematics, statistics, and charting/plotting. Some notables include:
DataFrame
class is useful for spreadsheet-like representation and mannipulation of data. Also includes high-level plotting functionality.Free and open
Python is released on all platforms under an open license (Python Software Foundation License), meaning that the language and its source is freely distributable. Not only does this keep costs down for scientists and universities operating under a limited budget, but it also frees programmers from licensing concerns for any software they may develop. There is little reason to buy expensive licenses for software such as Matlab or Maple, when Python can provide the same functionality for free!
In [1]:
import numpy
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Once you've loaded the library, we can ask the library to read our data file for us:
In [2]:
numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
Out[2]:
The expression numpy.loadtxt()
is a function call
that asks Python to run the function loadtxt
that belongs to the numpy
library.
This dotted notation is used everywhere in Python
to refer to the parts of things as thing.component
.
numpy.loadtxt
has two parameters:
the name of the file we want to read,
and the delimiter that separates values on a line.
These both need to be character strings (or strings for short),
so we put them in quotes.
When we are finished typing and press Shift+Enter
,
the notebook runs our command.
Since we haven't told it to do anything else with the function's output,
the notebook displays it.
In this case,
that output is the data we just loaded.
By default,
only a few rows and columns are shown
(with ...
to omit elements when displaying big arrays).
To save space,
Python displays numbers as 1.
instead of 1.0
when there's nothing interesting after the decimal point.
Our call to numpy.loadtxt
read our file,
but didn't save the data in memory.
To do that,
we need to assign the array to a variable.
A variable is just a name for a value,
such as x
, current_temperature
, or subject_id
.
Python's variables must begin with a letter and are case sensitive.
We can create a new variable by assigning a value to it using =
.
As an illustration,
let's step back and instead of considering a table of data,
consider the simplest "collection" of data,
a single value.
The line below assigns the value 55
to a variable weight_kg
:
In [3]:
weight_kg = 55
Once a variable has a value, we can print it to the screen:
In [4]:
weight_kg
Out[4]:
and do arithmetic with it:
In [5]:
print('weight in pounds:', 2.2 * weight_kg)
We can also change a variable's value by assigning it a new one:
In [7]:
weight_kg = 57.5
print('weight in kilograms is now:', weight_kg)
As the example above shows, we can print several things at once by separating them with commas.
If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value:
This means that assigning a value to one variable does not change the values of other variables. For example, let's store the subject's weight in pounds in a variable:
In [8]:
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
and then change weight_kg
:
In [9]:
weight_kg = 100.0
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
Since weight_lb
doesn't "remember" where its value came from,
it isn't automatically updated when weight_kg
changes.
This is different from the way spreadsheets work.
Just as we can assign a single value to a variable, we can also assign an array of values
to a variable using the same syntax. Let's re-run numpy.loadtxt
and save its result:
In [10]:
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:
In [11]:
data
Out[11]:
Now that our data is in memory,
we can start doing things with it.
First,
let's ask what type of thing data
refers to:
In [12]:
type(data)
Out[12]:
The output tells us that data
currently refers to an n-dimensional array created by the NumPy library. These data corresponds to arthritis patient's inflammation. The rows are the individual patients and the columns are there daily inflammation measurements.
We can see what its shape is like this:
In [14]:
data.shape
Out[14]:
This tells us that data
has 60 rows and 40 columns. When we created the
variable data
to store our arthritis data, we didn't just create the array, we also
created information about the array, called
attributes. This extra information describes data
in
the same way an adjective describes a noun.
data.shape
is an attribute of data
which described the dimensions of data
.
We use the same dotted notation for the attributes of variables
that we use for the functions in libraries
because they have the same part-and-whole relationship.
If we want to get a single number from the array, we must provide an index in square brackets, just as we do in math:
In [14]:
data[0, 0]
Out[14]:
In [15]:
data[30, 20]
Out[15]:
The expression data[30, 20]
may not surprise you,
but data[0, 0]
might.
Programming languages like Fortran and MATLAB start counting at 1,
because that's what human beings have done for thousands of years.
Languages in the C family (including C++, Java, Perl, and Python) count from 0
because that's simpler for computers to do.
As a result,
if we have an M×N array in Python,
its indices go from 0 to M-1 on the first axis
and 0 to N-1 on the second.
It takes a bit of getting used to,
but one way to remember the rule is that
the index is how many steps we have to take from the start to get the item we want.
An index like [30, 20]
selects a single element of an array,
but we can select whole sections as well.
For example,
we can select the first ten days (columns) of values
for the first four patients (rows) like this:
In [18]:
data[0:4, 0:10]
Out[18]:
The slice 0:4
means,
"Start at index 0 and go up to, but not including, index 4."
Again,
the up-to-but-not-including takes a bit of getting used to,
but the rule is that the difference between the upper and lower bounds is the number of values in the slice.
We don't have to start slices at 0:
In [16]:
print(data[5:10, 0:10])
We also don't have to include the upper and lower bound on the slice. If we don't include the lower bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the axis, and if we don't include either (i.e., if we just use ':' on its own), the slice includes everything:
In [17]:
small = data[:3, 36:]
print('small is:')
small
Out[17]:
Arrays also know how to perform common mathematical operations on their values. The simplest operations with data are arithmetic: add, subtract, multiply, and divide. When you do such operations on arrays, the operation is done element-wise on the array. Thus:
In [19]:
doubledata = data * 2.0
will create a new array doubledata
whose elements have the value of two times the value of the corresponding elements in data
:
In [23]:
print('original:')
data[:3, 36:]
Out[23]:
In [24]:
print('doubledata:')
doubledata[:3, 36:]
Out[24]:
If, instead of taking an array and doing arithmetic with a single value (as above) you did the arithmetic operation with another array of the same shape, the operation will be done on corresponding elements of the two arrays. Thus:
In [21]:
tripledata = doubledata + data
will give you an array where tripledata[0,0]
will equal doubledata[0,0]
plus data[0,0]
,
and so on for all other elements of the arrays.
In [22]:
print('tripledata:')
tripledata[:3, 36:]
Out[22]:
Often, we want to do more than add, subtract, multiply, and divide values of data. Arrays also know how to do more complex operations on their values. If we want to find the average inflammation for all patients on all days, for example, we can just ask the array for its mean value
In [25]:
data.mean()
Out[25]:
mean
is a method of the array.
A method is simply a function that is an attribute of the array,
in the same way that the member shape
does.
If variables are nouns, methods are verbs:
they are what the thing in question knows how to do.
We need empty parentheses for data.mean()
,
even when we're not passing in any parameters,
to tell Python to go and do something for us. data.shape
doesn't
need ()
because it is just a description but data.mean()
requires the ()
because it is an action.
NumPy arrays have lots of useful methods:
In [26]:
print('maximum inflammation:', data.max())
print('minimum inflammation:', data.min())
print('standard deviation:', data.std())
When analyzing data, though, we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to create a new temporary array of the data we want, then ask it to do the calculation:
In [27]:
patient_0 = data[0, :] # 0 on the first axis, everything on the second
print('maximum inflammation for patient 0:', patient_0.max())
We don't actually need to store the row in a variable of its own. Instead, we can combine the selection and the method call:
In [28]:
print('maximum inflammation for patient 2:', data[2, :].max())
What if we need the maximum inflammation for all patients (as in the next diagram on the left), or the average for each day (as in the diagram on the right)? As the diagram below shows, we want to perform the operation across an axis:
To support this, most array methods allow us to specify the axis we want to be consumed by the operation. If we ask for the average across axis 0 (rows in our 2D example), we get:
In [29]:
data.mean(axis=0)
Out[29]:
As a quick check, we can ask this array what its shape is:
In [30]:
data.mean(axis=0).shape
Out[30]:
The expression (40,)
tells us we have an N×1 vector,
so this is the average inflammation per day for all patients.
If we average across axis 1 (columns in our 2D example), we get:
In [31]:
data.mean(axis=1)
Out[31]:
which is the average inflammation per patient across all days.
The mathematician Richard Hamming once said,
"The purpose of computing is insight, not numbers,"
and the best way to develop insight is often to visualize data.
Visualization deserves an entire lecture (or course) of its own,
but we can explore a few features of Python's matplotlib
library here.
While there is no "official" plotting library,
this package is the de facto standard.
First,
we will import the pyplot
module from matplotlib
and use two of its functions to create and display a heat map of our data:
In [32]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(data)
Out[32]:
Blue regions in this heat map are low values, while red shows high values. As we can see, inflammation rises and falls over a 40-day period.
Some IPython magic
If you're using an IPython / Jupyter notebook, you'll need to execute the following command in order for your matplotlib images to appear in the notebook:
% matplotlib inline
The
%
indicates an IPython magic function - a function that is only valid within the notebook environment. Note that you only have to execute this function once per notebook.
Let's take a look at the average inflammation over time:
In [34]:
ave_inflammation = data.mean(axis=0)
plt.plot(ave_inflammation)
Out[34]:
Here,
we have put the average per day across all patients in the variable ave_inflammation
,
then asked matplotlib.pyplot
to create and display a line graph of those values.
The result is roughly a linear rise and fall,
which is suspicious:
based on other studies,
we expect a sharper rise and slower fall.
Let's have a look at two other statistics:
In [34]:
plt.plot(data.max(axis=0))
Out[34]:
In [35]:
plt.plot(data.min(axis=0))
Out[35]:
The maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there's a mistake in our calculations or something is wrong with our data.
You can group similar plots in a single figure using subplots. This script below uses a number of new commands. The function figure()
creates a space into which we will place all of our plots. The parameter figsize
tells Python how big to make this space. Each subplot is placed into the figure using the subplot
command. The subplot command takes 3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameters denotes which subplot your variable is referencing. Each subplot is stored in a different variable (axes1
, axes2
, axes3
). Once a subplot is created, the axes are can be titled using the set_xlabel()
command (or set_ylabel()
). Here are our three plots side by side:
In [35]:
fig = plt.figure(figsize=(10.0, 3.0))
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
axes1.set_ylabel('average')
axes1.plot(data.mean(axis=0))
axes2.set_ylabel('max')
axes2.plot(data.max(axis=0))
axes3.set_ylabel('min')
axes3.plot(data.min(axis=0))
fig.tight_layout()
The call to loadtxt
reads our data,
and the rest of the program tells the plotting library
how large we want the figure to be,
that we're creating three sub-plots,
what to draw for each one,
and that we want a tight layout.
(Perversely,
if we leave out that call to fig.tight_layout()
,
the graphs will actually be squeezed together more closely.)
In [36]:
element = 'oxygen'
print('first three characters:', element[:3])
print('last three characters:', element[3:6])
What is the value of element[:4]
?
What about element[4:]
?
Or element[:]
?
What is element[-1]
?
What is element[-2]
?
Given those answers,
explain what element[1:-1]
does.
In [ ]:
Above, we wrote some code that plots some values of interest from our first inflammation dataset, and reveals some suspicious features in it.
We have a dozen data sets right now, though, and more on the way. We want to create plots for all of our data sets with a single statement. To do that, we'll have to teach the computer how to repeat things.
An example task that we might want to repeat is printing each character in a
word on a line of its own. One way to do this would be to use a series of print
statements:
In [38]:
word = 'lead'
print(word[0])
print(word[1])
print(word[2])
print(word[3])
This is a bad approach for two reasons:
It doesn't scale: if we want to print the characters in a string that's hundreds of letters long, we'd be better off just typing them in.
It's fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we're asking for characters that don't exist.
In [39]:
word = 'tin'
print(word[0])
print(word[1])
print(word[2])
print(word[3])
Here's a better approach:
In [40]:
word = 'lead'
for char in word:
print(char)
This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well:
In [41]:
word = 'oxygen'
for char in word:
print(char)
The improved version of print_characters
uses a for
loop
to repeat an operation---in this case, printing---once for each thing in a collection.
The general form of a loop is:
for variable in collection:
do things with variable
We can call the loop variable anything we like, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. Unlike many other languages, there is no command to end a loop (e.g. end for); what is indented after the for statement belongs to the loop.
Here's another loop that repeatedly updates a variable:
In [42]:
length = 0
for vowel in 'aeiou':
length = length + 1
print('There are', length, 'vowels')
It's worth tracing the execution of this little program step by step.
Since there are five characters in 'aeiou'
,
the statement on line 3 will be executed five times.
The first time around,
length
is zero (the value assigned to it on line 1)
and vowel
is 'a'
.
The statement adds 1 to the old value of length
,
producing 1,
and updates length
to refer to that new value.
The next time around,
vowel
is 'e'
and length
is 1,
so length
is updated to be 2.
After three more updates,
length
is 5;
since there is nothing left in 'aeiou'
for Python to process,
the loop finishes
and the print
statement on line 4 tells us our final answer.
Note that a loop variable is just a variable that's being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:
In [43]:
letter = 'z'
for letter in 'abc':
print(letter)
print('after the loop, letter is', letter)
Note also that finding the length of a string is such a common operation
that Python actually has a built-in function to do it called len
:
In [44]:
len('aeiou')
Out[44]:
len
is much faster than any function we could write ourselves,
and much easier to read than a two-line loop;
it will also give us the length of many other things that we haven't met yet,
so we should always use it when we can.
Exponentiation is built into Python:
In [45]:
5**3
Out[45]:
Write a loop that calculates the same result as 5 ** 3
using
multiplication (and without exponentiation).
In [ ]:
In [ ]:
In [49]:
odds = [1, 3, 5, 7]
print('odds are:', odds)
We select individual elements from lists by indexing them:
In [50]:
print('first and last:', odds[0], odds[-1])
and if we loop over a list, the loop variable is assigned elements one at a time:
In [51]:
for number in odds:
print(number)
There is one important difference between lists and strings: we can change the values in a list, but we cannot change the characters in a string. For example:
In [52]:
names = ['Newton', 'Darwing', 'Turing'] # typo in Darwin's name
print('names is originally:', names)
names[1] = 'Darwin' # correct the name
print('final value of names:', names)
works, but:
In [53]:
name = 'Bell'
name[0] = 'b'
does not.
Ch-Ch-Ch-Changes
Data which can be modified in place is called mutable, while data which cannot be modified is called immutable. Strings and numbers are immutable. This does not mean that variables with string or number values are constants, but when we want to change the value of a string or number variable, we can only replace the old value with a completely new value.
Lists and arrays, on the other hand, are mutable: we can modify them after they have been created. We can change individual elements, append new elements, or reorder the whole list. For some operations, like sorting, we can choose whether to use a function that modifies the data in place or a function that returns a modified copy and leaves the original unchanged.
Be careful when modifying data in place. If two variables refer to the same list, and you modify the list value, it will change for both variables! If you want variables with mutable values to be independent, you must make a copy of the value when you assign it.
Because of pitfalls like this, code which modifies data in place can be more difficult to understand. However, it is often far more efficient to modify a large data structure in place than to create a modified copy for every small change. You should consider both of these aspects when writing your code.
There are many ways to change the contents of lists besides assigning new values to individual elements:
In [57]:
odds.append(11)
print('odds after adding a value:', odds)
In [58]:
del odds[0]
print('odds after removing the first element:', odds)
In [59]:
odds.reverse()
print('odds after reversing:', odds)
While modifying in place, it is useful to remember that python treats lists in a slightly counterintuitive way.
If we make a list and (attempt to) copy it then modify in place, we can cause all sorts of trouble:
In [60]:
odds = [1, 3, 5, 7]
primes = odds
primes += [2]
print('primes:', primes)
print('odds:', odds)
This is because python stores a list in memory, and then can use multiple names to refer to the same list. If all we want to do is copy a (simple) list, we can index the values into a new list, so we do not modify a list we did not mean to:
In [61]:
odds = [1, 3, 5, 7]
# remember what this does!
primes = odds[:]
primes += [2]
print('primes:', primes)
print('odds:', odds)
In [62]:
["h", "e", "l", "l", "o"]
Out[62]:
Hint: You can create an empty list like this:
In [63]:
my_list = []
In [64]:
(34,90,56) # Tuple with three elements
Out[64]:
In [65]:
(15,) # Tuple with one element
Out[65]:
In [66]:
(12, 'foobar') # Mixed tuple
Out[66]:
As with lists, individual elements in a tuple can be accessed by indexing.
In [68]:
foo = (5, 7, 2, 8, 2, -1, 0, 4)
foo[4]
Out[68]:
The tuple
function can be used to cast any sequence into a tuple:
In [69]:
tuple('foobar')
Out[69]:
One of the more flexible built-in data structures is the dictionary. A dictionary maps a collection of values to a set of associated keys. These mappings are mutable, and unlike lists or tuples, are unordered. Hence, rather than using the sequence index to return elements of the collection, the corresponding key must be used. Dictionaries are specified by a comma-separated sequence of keys and values, which are separated in turn by colons. The dictionary is enclosed by curly braces.
For example:
In [70]:
my_dict = {'a':16,
'b':(4,5),
'foo':'''(noun) a term used as a universal substitute
for something real, especially when discussing technological ideas and
problems'''}
my_dict
Out[70]:
In [71]:
my_dict['b']
Out[71]:
Notice that a
indexes an integer, b
a tuple, and foo
a string. Hence, a dictionary is a sort of associative array. Some languages refer to such a structure as a hash or key-value store.
As with lists, being mutable, dictionaries have a variety of methods and functions that take dictionary arguments. For example, some dictionary functions include:
In [72]:
len(my_dict)
Out[72]:
We can also check an object for membership in a dictionary using the in
expression:
In [74]:
'a' in my_dict
Out[74]:
Some useful dictionary methods are:
In [75]:
# Returns key/value pairs as list
my_dict.items()
Out[75]:
In [76]:
# Returns list of keys
my_dict.keys()
Out[76]:
In [77]:
# Returns list of values
my_dict.values()
Out[77]:
When we try to index a value that does not exist, it raises a KeyError
.
In [78]:
my_dict['c']
If we would rather not get the error, we can use the get
method, which returns None
if the value is not present.
In [79]:
my_dict.get('c')
Custom return values can be specified with a second argument.
In [80]:
my_dict.get('c', -1)
Out[80]:
It is easy to remove items from a dictionary.
In [81]:
my_dict.popitem()
Out[81]:
In [82]:
my_dict
Out[82]:
In [83]:
my_dict.clear()
In [84]:
my_dict
Out[84]:
In [85]:
import glob
The glob
library contains a single function, also called glob
,
that finds files whose names match a pattern.
We provide those patterns as strings:
the character *
matches zero or more characters,
while ?
matches any one character.
We can use this to get the names of all the HTML files in the current directory:
In [87]:
from glob import glob
glob('*.html')
Out[87]:
As these examples show,
glob.glob
's result is a list of strings,
which means we can loop over it
to do something with each filename in turn.
In our case,
the "something" we want to do is generate a set of plots for each file in our inflammation dataset.
Let's test it by analyzing the first three files in the list:
In [88]:
filenames = glob('data/inflammation*.csv')
filenames = filenames[0:3]
for f in filenames:
print(f)
data = numpy.loadtxt(fname=f, delimiter=',')
fig = plt.figure(figsize=(10.0, 3.0))
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
axes1.set_ylabel('average')
axes1.plot(data.mean(axis=0))
axes2.set_ylabel('max')
axes2.plot(data.max(axis=0))
axes3.set_ylabel('min')
axes3.plot(data.min(axis=0))
fig.tight_layout()
Sure enough, the maxima of the first two data sets show exactly the same ramp as the first, and their minima show the same staircase structure; a different situation has been revealed in the third dataset, where the maxima are a bit less regular, but the minima are consistently zero.
We can ask Python to take different actions, depending on a condition, with an if statement:
In [89]:
num = 37
if num > 100:
print('greater')
else:
print('not greater')
print('done')
The second line of this code uses the keyword if
to tell Python that we want to make a choice.
If the test that follows the if
statement is true,
the body of the if
(i.e., the lines indented underneath it) are executed.
If the test is false,
the body of the else
is executed instead.
Only one or the other is ever executed:
Conditional statements don't have to include an else
.
If there isn't one,
Python simply does nothing if the test is false:
In [90]:
num = 53
print('before conditional...')
if num > 100:
print('53 is greater than 100')
print('...after conditional')
We can also chain several tests together using elif
,
which is short for "else if".
The following Python code uses elif
to print the sign of a number.
In [91]:
num = -3
if num > 0:
print(num, "is positive")
elif num == 0:
print(num, "is zero")
else:
print(num, "is negative")
One important thing to notice in the code above is that we use a double equals sign ==
to test for equality
rather than a single equals sign
because the latter is used to mean assignment.
We can also combine tests using and
and or
.
and
is only true if both parts are true:
In [94]:
if (1 > 0) and (-1 > 0):
print('both parts are true')
else:
print('at least one part is not true')
while or
is true if at least one part is true:
In [93]:
if (1 < 0) or (-1 < 0):
print('at least one test is true')
Now that we've seen how conditionals work,
we can use them to check for the suspicious features we saw in our inflammation data.
In the first couple of plots, the maximum inflammation per day
seemed to rise like a straight line, one unit per day.
We can check for this inside the for
loop we wrote with the following conditional:
if data.max(axis=0)[0] == 0 and data.max(axis=0)[20] == 20:
print('Suspicious looking maxima!')
We also saw a different problem in the third dataset;
the minima per day were all zero (looks like a healthy person snuck into our study).
We can also check for this with an elif
condition:
elif data.min(axis=0).sum() == 0:
print('Minima add up to zero!')
And if neither of these conditions are true, we can use else
to give the all-clear:
else:
print('Seems OK!')
Let's test that out:
In [95]:
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
if data.max(axis=0)[0] == 0 and data.max(axis=0)[20] == 20:
print('Suspicious looking maxima!')
elif data.min(axis=0).sum() == 0:
print('Minima add up to zero!')
else:
print('Seems OK!')
In [96]:
data = numpy.loadtxt(fname='data/inflammation-03.csv', delimiter=',')
if data.max(axis=0)[0] == 0 and data.max(axis=0)[20] == 20:
print('Suspicious looking maxima!')
elif data.min(axis=0).sum() == 0:
print('Minima add up to zero!')
else:
print('Seems OK!')
In this way,
we have asked Python to do something different depending on the condition of our data.
Here we printed messages in all cases,
but we could also imagine not using the else
catch-all
so that messages are only printed when something is wrong,
freeing us from having to manually examine every plot for features we've seen before.
True
and False
are special words in Python called booleans
which represent true
and false statements. However, they aren't the only values in Python that are true and false.
In fact, any value can be used in an if
or elif
.
After reading and running the code below,
explain what the rule is for which values are considered true and which are considered false.
In [68]:
if '':
print('empty string is true')
if 'word':
print('word is true')
if []:
print('empty list is true')
if [1, 2, 3]:
print('non-empty list is true')
if 0:
print('zero is true')
if 1:
print('one is true')
In [97]:
x = 1 # original value
x += 1 # add one to x, assigning result back to x
x *= 3 # multiply x by 3
x
Out[97]:
At this point, we've written code to draw some interesting features in our inflammation data, loop over all our data files to quickly draw these plots for each of them, and have Python make decisions based on what it sees in our data. But, our code is getting pretty long and complicated; what if we had thousands of datasets, and didn't want to generate a figure for every single one? Commenting out the figure-drawing code is a nuisance. Also, what if we want to use that code again, on a different dataset or at a different point in our program? Cutting and pasting it is going to make our code get very long and very repetative, very quickly. We'd like a way to package our code so that it is easier to reuse, and Python provides for this by letting us define things called functions - a shorthand way of re-executing longer pieces of code.
Let's start by defining a function fahr_to_kelvin
that converts temperatures from Fahrenheit to Kelvin:
In [99]:
def fahr_to_kelvin(temp):
return ((temp - 32) * (5/9)) + 273.15
The function definition opens with the word def
,
which is followed by the name of the function
and a parenthesized list of parameter names.
The body of the function --- the
statements that are executed when it runs --- is indented below the definition line,
typically by four spaces.
When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.
Let's try running our function. Calling our own function is no different from calling any other function:
In [100]:
print('freezing point of water:', fahr_to_kelvin(32))
print('boiling point of water:', fahr_to_kelvin(212))
We've successfully called the function that we defined, and we have access to the value that we returned.
Integer division
We are using Python 3, where division always returns a floating point number:
$ python3 -c "print(5/9)"
0.5555555555555556
Unfortunately, this wasn't the case in Python 2:
>>> 5/9
0
If you are using Python 2 and want to keep the fractional part of division you need to convert one or the other number to floating point:
>>> 5.0/9
0.555555555556
>>> 5/9.0
0.555555555556
And if you want an integer result from division in Python 3, use a double-slash:
>>> 3//2
1
Now that we've seen how to turn Fahrenheit into Kelvin, it's easy to turn Kelvin into Celsius:
In [101]:
def kelvin_to_celsius(temp):
return temp - 273.15
print('absolute zero in Celsius:', kelvin_to_celsius(0.0))
What about converting Fahrenheit to Celsius? We could write out the formula, but we don't need to. Instead, we can compose the required function, based on the two functions we have already created:
In [102]:
def fahr_to_celsius(temp):
temp_k = fahr_to_kelvin(temp)
result = kelvin_to_celsius(temp_k)
return result
print('freezing point of water in Celsius:', fahr_to_celsius(32.0))
This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-large chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here --- typically half a dozen to a few dozen lines --- but they shouldn't ever be much longer than that, or the next person who reads it won't be able to understand what's going on.
Now that we know how to wrap bits of code up in functions,
we can make our inflammation analyasis easier to read and easier to reuse.
First, let's make an analyze
function that generates our plots:
In [107]:
def analyze(filename):
data = numpy.loadtxt(fname=filename, delimiter=',')
fig = plt.figure(figsize=(10.0, 3.0))
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
axes1.set_ylabel('average')
axes1.plot(data.mean(axis=0))
axes2.set_ylabel('max')
axes2.plot(data.max(axis=0))
axes2.set_title(filename[:-4])
axes3.set_ylabel('min')
axes3.plot(data.min(axis=0))
fig.tight_layout()
and another function called detect_problems
that checks for those systematics
we noticed:
In [108]:
def detect_problems(filename):
data = numpy.loadtxt(fname=filename, delimiter=',')
if data.max(axis=0)[0] == 0 and data.max(axis=0)[20] == 20:
print('Suspicious looking maxima!')
elif data.min(axis=0).sum() == 0:
print('Minima add up to zero!')
else:
print('Seems OK!')
Notice that rather than jumbling this code together in one giant for
loop,
we can now read and reuse both ideas separately.
We can reproduce the previous analysis with a much simpler for
loop:
In [109]:
for f in filenames[:3]:
print('\nOpening file', f)
analyze(f)
detect_problems(f)
By giving our functions human-readable names,
we can more easily read and understand what is happening in the for
loop.
Even better, if at some later date we want to use either of those pieces of code again,
we can do so in a single line.
Once we start putting things in functions so that we can re-use them, we need to start testing that those functions are working correctly. To see how to do this, let's write a function to center a dataset around a particular value:
In [110]:
def center(data, desired):
return (data - data.mean()) + desired
We could test this on our actual data, but since we don't know what the values ought to be, it will be hard to tell if the result was correct. Instead, let's use NumPy to create a matrix of 0's and then center that around 3:
In [111]:
z = numpy.zeros((2,2))
print(center(z, 3))
That looks right,
so let's try center
on our real data:
In [112]:
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
print(center(data, 0))
It's hard to tell from the default output whether the result is correct, but there are a few simple tests that will reassure us:
In [113]:
print('original min, mean, and max are:', data.min(), data.mean(), data.max())
centered = center(data, 0)
print('min, mean, and and max of centered data are:', centered.min(), centered.mean(), centered.max())
That seems almost right: the original mean was about 6.1, so the lower bound from zero is how about -6.1. The mean of the centered data isn't quite zero --- we'll explore why not in the challenges --- but it's pretty close. We can even go further and check that the standard deviation hasn't changed:
In [114]:
print('std dev before and after:', data.std(), centered.std())
Those values look the same, but we probably wouldn't notice if they were different in the sixth decimal place. Let's do this instead:
In [115]:
print('difference in standard deviations before and after:', data.std() - centered.std())
Again, the difference is very small. It's still possible that our function is wrong, but it seems unlikely enough that we should probably get back to doing our analysis. We have one more task first, though: we should write some documentation for our function to remind ourselves later what it's for and how to use it.
The usual way to put documentation in software is to add comments like this:
In [116]:
# center(data, desired): return a new array containing the original data centered around the desired value.
def center(data, desired):
return (data - data.mean()) + desired
There's a better way, though. If the first thing in a function is a string that isn't assigned to a variable, that string is attached to the function as its documentation:
In [117]:
def center(data, desired):
'''Return a new array containing the original data centered around the desired value.'''
return (data - data.mean()) + desired
This is better because we can now ask Python's built-in help system to show us the documentation for the function:
In [118]:
help(center)
A string like this is called a docstring. We don't need to use triple quotes when we write one, but if we do, we can break the string across multiple lines:
In [119]:
def center(data, desired):
'''Return a new array containing the original data centered around the desired value.
Example: center([1, 2, 3], 0) => [-1, 0, 1]'''
return (data - data.mean()) + desired
help(center)
In [120]:
numpy.loadtxt('data/inflammation-01.csv', delimiter=',')
Out[120]:
but we still need to say delimiter=
:
In [121]:
numpy.loadtxt('data/inflammation-01.csv', ',')
To understand what's going on,
and make our own functions easier to use,
let's re-define our center
function like this:
In [122]:
def center(data, desired=0.0):
'''Return a new array containing the original data centered around the desired value (0 by default).
Example: center([1, 2, 3], 0) => [-1, 0, 1]'''
return (data - data.mean()) + desired
The key change is that the second parameter is now written desired=0.0
instead of just desired
.
If we call the function with two arguments,
it works as it did before:
In [123]:
test_data = numpy.zeros((2, 2))
print(center(test_data, 3))
But we can also now call it with just one parameter,
in which case desired
is automatically assigned the default value of 0.0:
In [124]:
more_data = 5 + numpy.zeros((2, 2))
print('data before centering:')
print(more_data)
print('centered data:')
print(center(more_data))
This is handy: if we usually want a function to work one way, but occasionally need it to do something else, we can allow people to pass a parameter when they need to but provide a default to make the normal case easier. The example below shows how Python matches values to parameters:
In [125]:
def display(a=1, b=2, c=3):
print('a:', a, 'b:', b, 'c:', c)
print('no parameters:')
display()
print('one parameter:')
display(55)
print('two parameters:')
display(55, 66)
As this example shows, parameters are matched up from left to right, and any that haven't been given a value explicitly get their default value. We can override this behavior by naming the value as we pass it in:
In [126]:
print('only setting the value of c')
display(c=77)
With that in hand,
let's look at the help for numpy.loadtxt
:
In [127]:
help(numpy.loadtxt)
There's a lot of information here, but the most important part is the first couple of lines:
loadtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None,
unpack=False, ndmin=0)
This tells us that loadtxt
has one parameter called fname
that doesn't have a default value,
and eight others that do.
If we call the function like this:
In [128]:
numpy.loadtxt('data/inflammation-01.csv', ',')
then the filename is assigned to fname
(which is what we want),
but the delimiter string ','
is assigned to dtype
rather than delimiter
,
because dtype
is the second parameter in the list. However ',' isn't a known dtype
so
our code produced an error message when we tried to run it.
When we call loadtxt
we don't have to provide fname=
for the filename because it's the
first item in the list, but if we want the ',' to be assigned to the variable delimiter
,
we do have to provide delimiter=
for the second parameter since delimiter
is not
the second parameter in the list.
"Adding" two strings produces their concatenation:
'a' + 'b'
is 'ab'
.
Write a function called fence
that takes two parameters called original
and wrapper
and returns a new string that has the wrapper character at the beginning and end of the original.
A call to your function should look like this:
print(fence('name', '*'))
*name*
In [ ]:
In [ ]:
In [129]:
f = 0
k = 0
def f2k(f):
k = ((f-32)*(5.0/9.0)) + 273.15
return k
f2k(8)
f2k(41)
f2k(32)
print(k)
Every programmer encounters errors, both those who are just beginning, and those who have been programming for years. Encountering errors and exceptions can be very frustrating at times, and can make coding feel like a hopeless endeavour. However, understanding what the different types of errors are and when you are likely to encounter them can help a lot. Once you know why you get certain types of errors, they become much easier to fix.
Errors in Python have a very specific form, called a traceback.
Let's examine one:
In [130]:
import errors_01
errors_01.favorite_ice_cream()
This particular traceback has two levels. You can determine the number of levels by looking for the number of arrows on the left hand side. In this case:
The first shows code from the cell above,
with an arrow pointing to Line 2 (which is favorite_ice_cream()
).
The second shows some code in another function (favorite_ice_cream
, located in the file errors_01.py
),
with an arrow pointing to Line 7 (which is print(ice_creams[3])
).
The last level is the actual place where the error occurred.
The other level(s) show what function the program executed to get to the next level down.
So, in this case, the program first performed a function call to the function favorite_ice_cream
.
Inside this function,
the program encountered an error on Line 7, when it tried to run the code print(ice_creams[3])
.
Long Tracebacks
Sometimes, you might see a traceback that is very long -- sometimes they might even be 20 levels deep! This can make it seem like something horrible happened, but really it just means that your program called many functions before it ran into the error. Most of the time, you can just pay attention to the bottom-most level, which is the actual place where the error occurred.
So what error did the program actually encounter?
In the last line of the traceback,
Python helpfully tells us the category or type of error (in this case, it is an IndexError
)
and a more detailed error message (in this case, it says "list index out of range").
If you encounter an error and don't know what it means, it is still important to read the traceback closely. That way, if you fix the error, but encounter a new one, you can tell that the error changed. Additionally, sometimes just knowing where the error occurred is enough to fix it, even if you don't entirely understand the message.
If you do encounter an error you don't recognize, try looking at the official documentation on errors. However, note that you may not always be able to find the error there, as it is possible to create custom errors. In that case, hopefully the custom error message is informative enough to help you figure out what went wrong.
When you forget a colon at the end of a line,
accidentally add one space too many when indenting under an if
statement,
or forget a parentheses,
you will encounter a syntax error.
This means that Python couldn't figure out how to read your program.
This is similar to forgetting punctuation in English:
this text is difficult to read there is no punctuation there is also no capitalization why is this hard because you have to figure out where each sentence ends you also have to figure out where each sentence begins to some extent it might be ambiguous if there should be a sentence break or not
People can typically figure out what is meant by text with no punctuation, but people are much smarter than computers. If Python doesn't know how to read the program, it will just give up and inform you with an error. For example:
In [132]:
def some_function()
msg = "hello, world!"
print(msg)
return msg
Here, Python tells us that there is a SyntaxError
on line 1,
and even puts a little arrow in the place where there is an issue.
In this case the problem is that the function definition is missing a colon at the end.
Actually, the function above has two issues with syntax.
If we fix the problem with the colon,
we see that there is also an IndentationError
,
which means that the lines in the function definition do not all have the same indentation:
In [133]:
def some_function():
msg = "hello, world!"
print(msg)
return msg
Both SyntaxError
and IndentationError
indicate a problem with the syntax of your program,
but an IndentationError
is more specific:
it always means that there is a problem with how your code is indented.
Tabs and Spaces
A quick note on indentation errors: they can sometimes be insidious, especially if you are mixing spaces and tabs. Because they are both whitespace, it is difficult to visually tell the difference. The IPython notebook actually gives us a bit of a hint, but not all Python editors will do that. In the following example, the first two lines are using a tab for indentation, while the third line uses four spaces:
def some_function(): msg = "hello, world!" print(msg) return msgFile "<ipython-input-5-653b36fbcd41>", line 4 return msg ^ IndentationError: unindent does not match any outer indentation level
By default, one tab is equivalent to eight spaces, so the only way to mix tabs and spaces is to make it look like this. In general, is is better to just never use tabs and always use spaces, because it can make things very confusing.
Another very common type of error is called a NameError
,
and occurs when you try to use a variable that does not exist.
For example:
In [134]:
print(a)
Variable name errors come with some of the most informative error messages, which are usually of the form "name 'the_variable_name' is not defined".
Why does this error message occur? That's harder question to answer, because it depends on what your code is supposed to do. However, there are a few very common reasons why you might have an undefined variable. The first is that you meant to use a string, but forgot to put quotes around it:
In [135]:
print(hello)
The second is that you just forgot to create the variable before using it.
In the following example,
count
should have been defined (e.g., with count = 0
) before the for loop:
In [136]:
for number in range(10):
count = count + number
print("The count is: " + str(count))
Finally, the third possibility is that you made a typo when you were writing your code.
Let's say we fixed the error above by adding the line Count = 0
before the for loop.
Frustratingly, this actually does not fix the error.
Remember that variables are case-sensitive,
so the variable count
is different from Count
. We still get the same error, because we still have not defined count
:
In [137]:
Count = 0
for number in range(10):
count = count + number
print("The count is: " + str(count))
Next up are errors having to do with containers (like lists and dictionaries) and the items within them. If you try to access an item in a list or a dictionary that does not exist, then you will get an error. This makes sense: if you asked someone what day they would like to get coffee, and they answered "caturday", you might be a bit annoyed. Python gets similarly annoyed if you try to ask it for an item that doesn't exist:
In [138]:
letters = ['a', 'b', 'c']
print("Letter #1 is " + letters[0])
print("Letter #2 is " + letters[1])
print("Letter #3 is " + letters[2])
print("Letter #4 is " + letters[3])
If you get an error that you've never seen before, searching the Internet for that error type often reveals common reasons why you might get that error.
Much of the content for this notebook was created by Software Carpentry. Here are some additional resources if you wish to continue learning Python: